YAITSASUG

Yet Another Image Transformation Service At Scale Using Golang

dani(dot)caba at gmail(dot)com

Who are you?

dcProfile := map[string]string{
  "name": "Daniel Caballero",
  "title": "Staff Devops Engineer",
  "mail": "dani(dot)caba at gmail(dot)com",
  "company": &SchibstedPT,
  "previously_at": []company{&NTTEurope, &Semantix, &Oracle},
  "linkedin": http.Get("https://www.linkedin.com/in/danicaba"),
  "extra": "Gestión DevOps de Arquitecturas IT@LaSalle",
}

So... I work

... I (some kinda) teach

... I (try to) program...

... I (would like to) rock...

... and I live

So... I value my time (a lot)

And I really don't like to waste it resolving incidents

Schibsgrñvahed..WHAT??

What is Schibsted?

Origin - Media houses

Marketplaces global expansion

Large group of companies

And SPT?

And SPT Platform Services?

It's about a developer experience...

{
    "format": "jpg",
    "watermark": {
        "location": "north",
        "margin": "20px",
        "dimension": "20%"
    },
    "actions": [
        {
            "resize": {
                "width": 300,
                "fit": {
                    "type": "clip"
                }
            }
        }
    ],
    "quality": 90
}

The journey

2+1/2 YEARS AGO

Firsts onboardings

Onboarding pipelines

Firsts nightmares

First (quite manual) release process

New Architecture

New Core

Self service capabilities

Updated onboarding pipelines

Current usage

(Your?) thoughts so far...

Why building your own service?

But there's already opensource http servers for that, right?

Why not offline transformations?

Why microservices?

+

  • Quicker releases
  • APIGW helps to delegate common functionality
    • But business agnostic ones
  • Reusability of individual microservices
  • Each microservice can chose different techs
    • We will focus in delivery-images, in Golang
  • Easier to scale with the organization/development team
    • Not taking advantage
  • and... fun

-

  • S2S communication overhead
  • Extra costs
  • More tooling required (logging, tracing...)
  • Reproduce the complete environment becomes tricky
  • Always caring about coupled services...

Why not CDN/edge transformations?

  • Some functionality may be covered...
    • Typically resizing and format conversion
    • But not all our functionality (watermarking?)
  • It may mean duplicated processing
  • Not easy to pack something like libvips as lambdas
  • No unique & global CDN in Schibsted

Not a new story... why not presenting it before?

Why transformations in golang?

Transformation library

  • imageflow was not production ready two years ago, with clear gaps on functionalities and bindings

Choosing the programming language

Platform (& development) properties

IaC

  • Most of the services in AWS...
  • Generating Cloudformations from python troposphere
  • Managing Cloudformation deployments with Sceptre
  • New projects with infrastructure definition in the same repo than the service code
    • Trying to extend CD to Infrastructure
  • We have assessed AWS GoFormation
    • But still lacks some functionality, like GetAtt or Ref

Code reviews

reviewersRaffle:
  strategies:
    - team-with-knowledge-candidates:
        size: 1
        type: knowledge
        participants:
          teams:
            - spt-infrastructure/edge-team
    - team-random-candidates:
        type: sequential
        size: 2
        participants:
          teams:
            - spt-infrastructure/edge-team
  dailyReminder: enabled
slack:
  - "#spt-edge-prs"

Other bots

Continuous integration and delivery

Travis

language: go
go:
- 1.9.3

script:
- diff -u <(echo -n) <(gofmt -s -d $(find . -type f -name '*.go' -not -path "./vendor/*"))
- docker login -u="$ARTIFACTORY_USER" -p="$ARTIFACTORY_PASSWORD" containers.schibsted.io
- "./requirements/start-requirements.sh -d"
- "_script/tests-docker"
- "_script/compile-docker"
- "_script/cibuild"

deploy:
  skip_cleanup: true
  on:
    all_branches: true
  provider: script
  script: _script/deploy

FPM

fpm -s dir \
    -t rpm \
    -n ${PACKAGE_NAME}${DEV} \
    -v ${VERSION} \
        --iteration ${ITERATION} \
    --description "Yams delivery images. Commit: ${GIT_COMMIT_ID}" \
    --before-install ${TRAVIS_BUILD_DIR}/_pkg/stopservice \
    --after-install ${TRAVIS_BUILD_DIR}/_pkg/postinst \
    --before-remove ${TRAVIS_BUILD_DIR}/_pkg/stopservice \
    --depends datadog-config \
    --depends sumologic-config \
    ${DIST_PATH}/${PACKAGE_NAME}/=/

Hardened images

Spinnaker

Acceptance & Stress testing

Locust

Vegeta

Configuration management

  • Using Netflix Archaius in all rxJava services
    • Configured with dynamo tables...
    • so dynamic reconfiguration is possible...
    • and quite useful when dealing with outages :)
  • ... but Viper in delivery-images

Logs

  • Sumologic
    • Now quite happy... but costs (logging 100G per day)
    • Daemon/sidecar with specific config (files to forward)

Using logrus in delivery-images

  • Disabling locking
  • Time/Date formatting equivalent to the logs in Java
  • Zap could be used...
    • but not concerned about overhead, as...
    • transforming images is quite resource-intensive

Monitoring and alerting

  • We use Datadog
    • System + Custom application metrics
      • Via statsd
    • Importing also Cloudwatch metrics
  • Extensive usage of:
    • Dashboards (troubleshooting and also KPIs)
    • Alerting

  pre:
    not_allowed_notify_to:
    - "@webhook-alert-gateway-sev3"
    - "@webhook-alert-gateway-sev2"
    - "@pagerduty"
    healthy_host_count_critical: 0.0
    healthy_host_count_warning: 0.5
  pro:
    not_allowed_notify_to:
    healthy_host_count_critical: 1.0
    healthy_host_count_warning: 2.0

monitors:
  - name: "[ALB] - {{name.name}} in region {{region.name}} - 5xx backend error rate"
    multi: false
    tags:
      - "app:yams"
    type: "metric alert"
    options:
      notify_audit: false
      timeout_h: 0
      require_full_window: false
      thresholds:
        warning: 0.05
        critical: 0.1
      notify_no_data: false

Pagerduty onCall escalations

Distributed tracing

Integration

func InitializeTracer() (opentracing.Tracer, io.Closer) {
        cfg := GetConfig()
        if !cfg.Enabled {
                return nil, nil
        }

        // Configure HTTP propagation (using zipkin headers)
        zipkinPropagator := zipkin.NewZipkinB3HTTPHeaderPropagator()
        injector := jaeger.TracerOptions.Injector(opentracing.HTTPHeaders, zipkinPropagator)
        extractor := jaeger.TracerOptions.Extractor(opentracing.HTTPHeaders, zipkinPropagator)

        // Zipkin shares span ID between client and server spans; it must be enabled via the following option.
        zipkinSharedRPCSpan := jaeger.TracerOptions.ZipkinSharedRPCSpan(true)

        // Zipkin Reporter (http transport)
        transport := NewHTTPTransport(cfg.Host, cfg.Port, HTTPBasicAuth(cfg.ApiId, cfg.ApiKey))
        reporter := jaeger.NewRemoteReporter(transport)

        // Probabilistic Sampler
        sampler, _ := jaeger.NewProbabilisticSampler(cfg.Rate)

        // Create tracer
        tracer, closer := jaeger.NewTracer(
                "yams-delivery-images",
                sampler,
                reporter,
                injector,
                extractor,
                zipkinSharedRPCSpan,
        )

        opentracing.SetGlobalTracer(tracer)
        return tracer, closer
}
if transformedImage.FromCache {
        b.Logger.WithField("id", requestId).Debug("Found Transformation in cache")
        tracing.AddLogToSpanInContext(request.Context(), "Got image from cache")
        b.Monitor.Incr("transformation.cache.hit", tags, 1)
} else {
        b.Monitor.Incr("transformation.cache.miss", tags, 1)
        transformationElapsed := time.Since(transformationStart)
        transformationElapsedInMillis := float64(transformationElapsed.Nanoseconds()) / 1000.0
        b.Monitor.TimeInMilliseconds("request.transformation.duration", transformationElapsedInMillis, tags, 1)
        b.Monitor.Gauge("request.transformation.duration", transformationElapsedInMillis, tags, 1)
        b.Logger.WithField("id", requestId).Debug("Transformation took: ", transformationElapsed.Seconds(), " secs")
        tracing.AddLogKvToSpanInContext(request.Context(), "transformation.duration", transformationElapsed.String())
}

Real time monitoring

S2S resiliency

Secrets management

delivery-images implementation details

HTTP router

Services communication

  • fargo
  • eureka
  • load balancing
  • hystrix

bi-image + libvips

Caching

Zipkin

Custom metrics

Gifs..

And rate limits..

go-kit/negron logrus logging

Integration tests execution

Datastore access

Graceful shutdowns

Autoscaling

AWS SDK usage

Stresstests

  • Locust -> Vegeta

Why not terraform?

Why not docker/k8s?

  • Portal
  • Migration exercise
  • Local tests

gRPC?

Why not Service Meshes?

Why not Google Cloud?

And Cassandra?

And PaaS?

And prometheus?

Actual future

Multiregion

Smoke tests

More elasticity to reduce costs

Extra compression

Currently jpg-turbo

Bringing the service closer to the business

More engines

Also in use by attachments and CVs. PDF conv makes sense. Bench already done Video

Actual transformation pipelines

Include current workflow

More adoption?

Better capacity management

Incoming queue and reusing cache if no capacity Better degradation but efficient ASG triggers

ApiGW replacement?

Zuul could be replaced by Krakend

Simulating dependencies failures

Before closing...

Are you going to opensource it?

Are you going to sell it?

Price calculator

  • Share price comparison

Other projects

Choosing the right regions

Classifier end to end tests

Corollary

Keep Rx in the code...

Great thanks...

  • Sch*
  • Edge colleagues

Other Qs?